Decision Trees

The authors define decision trees as a way to generate a prediction carrying out a series of tests on the values of the descriptive features describing a query instance, and use the answers to determine a prediction.

Descriptive features are used to make a test at the root node, the outcome of this test will choose which child node to the root node will be called, this process continues until a leaf node is reached and a finall prediction is made. The preference in building decision tree models is to build a shallow tree, or one which arrives at the expected level of accuracy with as few as nodes as possible. To determine which descriptive features should be used for which nodes, we need a concrete understanding of Shannon's Entropy Model.

Shanon's Entropy Model

"Claude Shanon's entropy model defins a computational measure of the impurity of the elements in a set." - Page 123.

A basic way to think about entropy is the uncertainty that you could guess a result if you made a random selection from a set. Sets which contain all of the same items have zero entropy, as you will always guess the correct result, however sets which have more unique items will have higher entropy as the chance that you will guess the correct result is higher.

Outcomes with a large probability have a low entropy, and outcomes with a low probability have a large entropy. The relationship can be defined using a logarithmic function.

What is the difference between entropy and probability?

Playing Cards and Entropy

An example from the book illustrated in python is to calulate the entropy of different scenarios using a deck of cards. The first scenario is to calculate the entropy of a scenario where you are trying to select a single random card. The lower the probability the higher the entropy. For instance, below we calcuate the probability of selecting a single random card is .019. The entropy of this scenario is the negative log 2 of this probability which is 5.7.

Information Gain

Information Gain - is defined as a measure of the reduction in the overall entropy of a set of instances that is achieved by testing on a descriptive feature. The book defines a three step process for computing information gain:

Page 131 and 132 break down the formula for iterating through feature sets to get information gain, and are good references to point back to. When computing information gain the higher the bit the more discriminatory it is, which is good for classification problems.

"A key part of doing this is being able to decide which tests should be included in the sequence and in what order. The information gain model we have developed allows us to decide which test we should add to the sequence next because it enables us to select the best feature to use on a given dataset. In the next section, we introduce the standard algorithm for growing decision trees in this way." - Page 132

Below I've taken the data set that they provide in the chapter, and created a pandas dataframe. Using this data I'll walk through information gain using python.

First we will want to understand the entropy of the entire set, we have two target classes, SPAM and HAM, so to compute the entropy of the data set with these two target classes is as follows:

A more succinct way of writing this for any target set in a dataframe would be:

Now that we have the entropy of entire data set given the two target classes, we would then compute the entropy after we split the data by each descriptive feature.

Say we split the data into a subset looking at only suspicious words, we have three occurrences where suspicious words is true and of those three occurrences have a target class of SPAM, none of HAM.

Based on the above information, we have a descriptive feature that can be used to completley determine the target class of SPAM or HAM, this typically wouldn't be the case, but showcases how you would use information gain to split data sets up by features to determine a target class.

Iterative Dichotomizer 3 (ID3)

ID3 is an approach for building the shallowest decision tree possible using information gain. By first computing the information gain of the descriptive features in the training set the tree selects the best first feature to use, the one with the highest information gain. A root node is added to the tree and the data set is partitioned based on the outcome of that first classification node. This process is then repeated removing the selected feature and data used to determine that root node.

The process is completed until all the instances in a partition have the same target level at which point a leaf node is created and labeled with that level.

Information Gain Ratio

Instead of using entropy as the measure of impurity which is used to determine which partitions happen in which order, we can use what is called the information gain ratio which takes the information gain and divides it by the amount of information used to determine the value of the feature. This removes a bias against selecting features further up in the root tree that split the data into smaller subsets.

GINI Index

"The GINI index can be understood as calculating how often the target levels of instances in a dataset would be misclassified if predictions were made on the sole basis of the distribution of the target levels in the dataset." - Page 145.

Tree Pruning

"By stopping the partitioning of the data early, however, induction algorithms that use pre-pruning can fail to create the most effective trees because they miss interactions between features that emerge within subtrees that are not obvious when the parent nodes are being considered. Pre-pruning can mean that these useful subtrees are never created." Page - 155.

Model Ensembles

A model ensemble is a model that is composed of a set of models. The defining characteristics of an ensemble model are:

Bagging

Bagging or bootstrap aggregating uses a random sample with replacement, which is called a bootstrap sample. One model is induced from each bootstrap sample. This has the affect of creating different samples. Decision trees are very sensitive to data in the training phase, as a different data set can cause the creation of different root nodes having significant ripple affects down the tree.

Models which combined bagging, subspace sampling and decision trees are known as random forest.

Boosting

Boosting is done by focusing on miss-classified pieces of data. Gradient Boosting iteratively trains prediction models trying to make later models specialize in areas that earlier models struggled with. It focuses on the errors that is created in each model. It predicts errors made in the base model, and adds models where each model is trained to reduce prediction error. Typically you will get a larger number of shallow trees, similar to random forest.